Parallelizing XML Processing Pipelines via MapReduce

نویسندگان

  • Daniel Zinn
  • Sven Köhler
  • Shawn Bowers
  • Bertram Ludäscher
چکیده

We present approaches for exploiting data parallelism in XML processing pipelines through novel compilation strategies to the MapReduce framework. Pipelines in our approach consist of sequences of processing steps that consume XML-structured data and produce, often through calls to “black-box” functions, modified (i.e., updated) XML structures. Our main contributions are a set of strategies for compiling such XML pipelines into parallel MapReduce networks and a discussion of their advantages and tradeoffs. We present a detailed experimental evaluation of these approaches using the Hadoop MapReduce system as our implementation platform. Our results show that execution times of XML pipelines can be significantly reduced using our compilation strategies. These efficiency gains, together with the benefits of MapReduce (e.g., fault tolerance) make our approach ideal for executing largescale, compute-intensive XML processing pipelines.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Parallelizing XML data-streaming workflows via MapReduce

In prior work it has been shown that the design of scientific workflows can benefit from a collection-oriented modeling paradigm which views scientific workflows as pipelines of XML stream processors. In this paper, we present approaches for exploiting data parallelism in XML processing pipelines through novel compilation strategies to the Map-Reduce framework. Pipelines in our approach consist...

متن کامل

Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce

Processing XML queries over big XML data using MapReduce has been studied in recent years. However, the existing works focus on partitioning XML documents and distributing XML fragments into different compute nodes. This attempt may introduce high overhead in XML fragment transferring from one node to another during MapReduce execution. Motivated by the structural join based XML query processin...

متن کامل

A Scalable XSLT Processing Framework based on MapReduce

The eXtensible Stylesheet Language Transformation (XSLT) is a de-facto standard for XML data transforming and extracting. Efficient processing of large amounts of XML data brings challenges to conventional XSLT processors, which are designed to run in a single machine context. To solve these data-intensive problems, MapReduce paradigm in the cloud computing domain has received a comprehensive a...

متن کامل

Distributed Processing of XPath Queries Using MapReduce

In this paper we investigate the problem of efficiently evaluating XPath queries over large XML data stored in a distributed manner. We propose a MapReduce algorithm based on a query decomposition which computes all expected answers in one MapReduce step. The algorithm can be applied over large XML data which is given either as a single distributed document or as a collection of small XML docum...

متن کامل

Device Neutral Pipelined Processing of XML Documents

XML languages are increasingly used for representations aimed to XML applications running on a variety of platforms. We have developed an architecture that automatically constructs combinations of XML processing pipelines that produce device specific representations. This is achieved by associating sets of elements and attributes of XML languages to processing pipelines according to platform ca...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009